-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support conversion to/from cudf in dask.dataframe.core.to_backend #12380
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor comment.
Is this critical? We're entering an already-delayed code freeze tonight, so it would be better to push this to 23.04. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-23.04 #12380 +/- ##
===============================================
Coverage ? 85.85%
===============================================
Files ? 158
Lines ? 25204
Branches ? 0
===============================================
Hits ? 21638
Misses ? 3566
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, although I think that this is probably safe, it doesn't really feel like the right kind of change we should be attempting to get in in code freeze; so I'm with @vyasr here that I think this is better to slip to 23.04.
if isinstance(data._meta, (cudf.DataFrame, cudf.Series, cudf.Index)): | ||
# Already a cudf-backed collection | ||
return data | ||
return data.map_partitions(cls.to_backend_dispatch(), **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this is a question for the dask side of the implementation, but under what circumstances can kwargs
be non-empty (and then relatedly, are there circumstances in which we already have a cudf-backed collection where non-empty kwargs could semantically require something different than data.map_partitions(identity)
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - I updated things a bit to be explicit about supported/unsupported key-word arguments. I also added a bit more test coverage.
Context: We (probably) do want to make it possible to do things like ddf.to_backend("cudf", nan_as_null=False)
. However, I'm still unsure of the best way to document the supported to_backend
kwargs for the various conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, can we link to the registered backend dispatch function which could explain what it accepts as keyword arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll give this a think, but not sure how/where to link to the dispatch-function args, since the available arguments to the user-facing API (to_backend) will depend on the specified backend. Any ideas?
Thanks for the review @wence- ! Agree with you both that this should go in 23.04 - The Dask documentation can be updated pretty easily to specify that |
/merge |
Depends on dask/dask#9758
Description
This PR corresponds to the
cudf
component of dask/dask#9758Checklist